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Abstract 

Background: The 'database search problem', that is, the strengthening of a case - in terms of probative value - 
against an individual who is found as a result of a database search, has been approached during the last two decades 
with substantial mathematical analyses, accompanied by lively debate and centrally opposing conclusions. This 
represents a challenging obstacle in teaching but also hinders a balanced and coherent discussion of the topic within 
the wider scientific and legal community. This paper revisits and tracks the associated mathematical analyses in terms 
of Bayesian networks. Their derivation and discussion for capturing probabilistic arguments that explain the database 
search problem are outlined in detail. The resulting Bayesian networks offer a distinct view on the main debated 
issues, along with further clarity. 

Methods: As a general framework for representing and analyzing formal arguments in probabilistic reasoning about 
uncertain target propositions (that is, whether or not a given individual is the source of a crime stain), this paper relies 
on graphical probability models, in particular, Bayesian networks. This graphical probability modeling approach is 
used to capture, within a single model, a series of key variables, such as the number of individuals in a database, the 
size of the population of potential crime stain sources, and the rarity of the corresponding analytical characteristics in 
a relevant population. 

Results: This paper demonstrates the feasibility of deriving Bayesian network structures for analyzing, representing, 
and tracking the database search problem. The output of the proposed models can be shown to agree with existing 
but exclusively formulaic approaches. 

Conclusions: The proposed Bayesian networks allow one to capture and analyze the currently most well-supported 
but reputedly counter-intuitive and difficult solution to the database search problem in a way that goes beyond the 
traditional, purely formulaic expressions. The method's graphical environment, along with its computational and 
probabilistic architectures, represents a rich package that offers analysts and discussants with additional modes of 
interaction, concise representation, and coherent communication. 
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Background expressions such as silver bullet' [1], the most powerful 

The emergence of DNA databases from a legal point of view innovation in forensics since fingerprinting' [2], or a per- 

DNA is widely held as a category of forensic trace mate- feet piece of evidence' [3] . Databases represent a transient 

rial that outperforms other forensically relevant material topic in that respect. Historically, modern DNA analy- 

on parameters such as reliability. This is reflected by opin- ses were first used as an investigative tool in an English 

ions maintained by both members of the general public criminal case in 1986, when Colin Pitchfork was pros- 
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- only after that considerable resources and time had 
been spent. At the time, DNA clearly lacked the element 
that gives it the formidable investigative capacities it has 
today: databases. 

The first DNA profile databases were established during 
the 1990s^. Since then, all major Western countries have 
enacted laws allowing the establishment of DNA profile 
databases, but the exact conditions under which they 
function vary from one jurisdiction to another. Besides, 
they are still accompanied by or cause democratic debate 
as to whose DNA profile should be taken and kept regis- 
tered. While databases may be seen as a natural byproduct 
of DNA typing, they now are used daily without many 
lawyers or even scientists devoting in-depth thought to 
the way a search through a database could influence the 
value of the DNA evidence itself. Forensic academics 
though have been struggling for at least a decade^ over the 
meaning of a match found through 'trawling a database' 
versus situations where suspects were found through 
other investigative means (that is, without the use of 
database). 

The outcomes of this debate, at times led rather con- 
troversially, are approached in this article from a distinct 
perspective of a graphical approach. As a principal aim, 
the discussion will focus on explaining how the use of a 
database impacts the value assigned to a match' between 
the profile of a trace found on the scene of a crime and 
the profile of a suspect. This question appears to have no 
intuitively obvious answer, and it may seem overly tech- 
nical to lawyers and other legal academics, but, as further 
emphasized in due course, it is in their interest to under- 
stand the challenges raised by DNA databases in terms of 
formal and argumentative interpretation procedures and 
the impact that this may have on their area of activity. 

This pairs with the more general tendency that the use 
of databases has fundamentally changed the way forensic 
evidence is currently processed, to the extent that, con- 
trary to more traditional modes of proof, the judiciary 
tends to lose control over a whole part of the administra- 
tion of the evidence [4]. So to speak, and as a matter of 
fact, a database can be viewed as a 'closed box' because its 
actual inner workings remain unknown not only to most 
defense lawyers, but also to many representatives of the 
judiciary, namely prosecutors, judges, and juries. Besides 
the challenge of interpreting the probative value of the so- 
called 'database hitsj the way in which a database is man- 
aged, the way that the correctness of typing results and 
registrations are controlled, or the way databases are used 
for calculating so-called 'rarity statistics' are all topics that 
remain largely outside the control of judicial actors. This is 
problematic because it may lead to unawareness that such 
questions could be debated and that the probative value of 
matches reported to legal actors are intrinsically linked to 
such issues. 



From a more general point of view, questioning the 
inferential assessment of database search results is a sub- 
ject all the more relevant because databases are growing 
continuously larger. With more people being registered 
every year, database searching of DNA profiles from traces 
of unknown origin involves comparisons with increas- 
ingly larger stocks of data. This motivates investigation 
of the knowledge, perception, and understanding of this 
situation, along with its practical implications in judicial 
proceedings. In the UK, for example, about 5% of the 
population*^ have had their profile taken and entered into 
the national DNA database, which not only comprises 
profiles from convicted and serious offenders, but also 
from people implicated in minor cases. Yet, the probabil- 
ity of finding a correspondence with an individual that is 
not the true source is not equal to zero. With a potential 
of adventitious matches, each database member thus runs 
a real risk to face a charge based on a 'database hit! For 
these reasons, questions that emanate from the use made 
of matches derived from database searches, as well as the 
assessment of their evidential value, are crucial and a topic 
that represents ongoing interest to the legal community. 

The legal perspective to interpretation of forensic evidence 

Assessing the evidential significance of results of database 
searches may appear as a marginal or exotic topic, but it 
is useful to consider it as part of scientific evidence inter- 
pretation in the broader context of legal proceedings. In 
Western countries, from an adversary as well as from an 
inquisitorial tradition, this condenses to a number of core 
principles even though distinct sets of legal rules gov- 
ern the various countries of jurisdiction. These principles 
cover, first, the requirement that only reliable evidence is 
admissible. Second, except in certain rare cases, the law 
does not assign a particular or predetermined value to a 
given item of evidence^. Even if, in practice, the word of an 
expert witness testifying as to the meaning of a reported 
match might carry some weight, it always remains the 
judge's (or the jury's) responsibility to set and retain, in 
finCf the probative value. To evaluate the reliability and 
value of a given piece of evidence, the decision maker is 
said to be free. This concept of freedom actually refers to 
the ancient modes of proof, when the law would set a hier- 
archy of the different types of evidence, from the strongest 
to the weakest (with confessions being traditionally the 
strongest piece of evidence). It would also set out rules 
as to the relative weight of certain types of evidence. For 
instance, the testimony of a man was twice as reliable as 
the testimony of a woman [5]. Judges had no real power to 
evaluate cases; their only duty was to count the items of 
evidence presented by each party and declare the prevail- 
ing side. Freedom of assessment thus only means that the 
law does not assign weight to different types of evidence. 
It does not imply that judges or juries are completely free 
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and can decide according to their temporary states of 
mind, that is, their mere mood. In fact, the law requires 
decision makers to proceed in a rational way, so as to avoid 
unfair or arbitrary decisions. 

This raises the question of what is meant by the notion 
of rationality in the context of the interpretation of foren- 
sic evidence. There is widespread agreement, supported 
by substantive argument, on the view that judges or juries 
should follow the rules of logic and of common scien- 
tific knowledge and that Bayesian reasoning provides a 
coherent framework to conform with this requirement 
[6-8]. This approach - of which Bayesian networks^ are a 
schematic illustration and retained as such in this paper 
- assists decision makers in their assessment of situations 
in the light of new pieces of evidence, but it does not, in 
itself, instruct its user about the actual probative value 
that ought to be given to, for instance, a DNA match. 
Once a match has been reported, it rather defines the 
general rules according to which ones beliefs should 
evolve in view of the uncertain target propositions, such 
as that according to which a given suspect is or is not 
the source of a stain found on the crime scene. Applying 
Bayes' inference in a particular situation requires one to 
specify a model. This will be the main topic of discussion 
pursued in the section "The Island' problem" and in later 
parts of this paper. 

Evidential value of 'database hits': two decades of debate 

'What is the strength of the evidence against a suspect 
who is found as a result of the search in a database?' 
This practical question, also sometimes referred to as 
'the database search problem', has led to considerable dis- 
cussion within the scientific community, including both 
forensic scientists and legal practitioners. Its implica- 
tions in the practice of criminal proceedings span a wide 
range. The debate was led essentially in the context of 
DNA evidence, but the underlying principle of searching 
databases containing analytical characteristics that serve 
as a basis for comparative forensic examinations applies 
also to other kinds or categories of scientific evidence 
[9]. Although this problem is strongly rooted in prac- 
tical applications, deciding on an appropriate approach 
to deal with this inference problem requires coherent 
methodological developments. 

Different answers, pointing in quite contrary directions, 
have been offered so far but are accompanied with sub- 
stantial mathematics. It is not the papers intention to 
retrace this debate in all its respects nor to oppose com- 
peting approaches. As a starting point, it suffices to note 
that the prevalent and most well- supported viewpoint is 
that a database search tends to strengthen a case against 
a 'matching' suspect [10-18]. This paper seeks to analyze 
and discuss the probabilistic tenets on which this stand- 
point is founded by invoking a methodology based on 



graphical probability models (that is, Bayesian networks). 
Some work in this direction has already been presented 
in [19,20]. A more recent paper also relied on Bayesian 
networks [21], but its main focus was on a slightly dif- 
ferent aspect, that is, the probability of false convictions. 
This paper will concentrate on the more restricted topic 
of how to infer the source of a crime stain. As will be seen, 
a graphical approach using Bayesian networks allows to 
demonstrate a logic that is in line with existing literature 
on this topic. 

Structure of the paper 

This paper is organized as follows. The 'Methods' section 
starts by providing general information about Bayesian 
networks and explains the rationale behind their use as a 
methodology in the study reported here. As an introduc- 
tory example and an initial finding, "The 'island' problem" 
section presents a Bayesian network approach for the 
well-known 'island problem'. This is a generic setting in 
which no database is involved [22]. The discussion thus 
seeks to introduce the graphical structure of probabilistic 
reasoning about the source of a crime stain in a situation 
where the use of a database is not an issue. This start- 
ing point is chosen in order to illustrate the logic of the 
extended argument that is - in later parts of the paper 
- developed for situations in which the profiles of some 
of the islanders are placed in a searchable database. This 
allows to point out the logical connection between these 
two evaluative scenarios. As will be seen, there are struc- 
tural analogies between the two analyses, and this gives 
further credit to the proposed solution for the database 
setting. In particular, it will be possible to show that the 
approach to the database search problem is merely a log- 
ical extension of the undisputed probabilistic solution to 
the island problem. In addition, the graphical interface of 
Bayesian networks will be shown to provide a clear, yet 
intuitively convincing explanation for an increase of the 
probability of the proposition according to which a match- 
ing suspect is the source of the crime stain, once other 
members of the same database are excluded (because they 
are found to present non-matching profiles). 

The section 'When some islanders are in a database' will 
introduce the database search setting more formally. The 
analyses pursued at that point focus on a stepwise presen- 
tation of settings with well-defined numbers of individuals 
for the size of the database as well as the pool of poten- 
tial crime stain donors. This aims at pointing out the 
rationale underlying the conclusion in basic cases. This is 
thought to further the understanding of solutions in sce- 
narios that extend to more general situations presented 
later in the same section. The section entitled 'A Bayesian 
network-guided derivation of the database search likeli- 
hood ratio' will reuse the previously introduced Bayesian 
network in order to point out that the proposed model 
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can also serve the purpose of illustrating the derivation 
of a likelihood ratio. This aspect is introduced because 
the previous sections mainly focused on the calculation 
of posterior probabilities for main propositions (for exam- 
ple, 'the suspect is the source of the crime stain). The 
merit of a Bayesian network-guided analysis for both pos- 
terior probabilities and likelihood ratios is discussed in the 
'Discussion and conclusions' section, along with general 
conclusions. Throughout the paper, the level of techni- 
cality for notation and calculation does not exceed that 
which is generally employed in existing legal literature on 
the topic, for example [18], but readers who wish to avoid 
the derivation of the mathematical background in order 
to concentrate on the proposed Bayesian networks may 
focus directly on the following sections: 'Bayesian network 
for the island problem,' 'Bayesian network for a database 
search setting: suspect and one other individual in the 
database,' 'Bayesian network for a search of a database of 
size n > 2,' and 'Discussion and conclusions'. 

Methods 

Preliminaries 

In the early 1980s, Bayesian networks have been devel- 
oped in the field of artificial intelligence as an approach 
that helps to apply the theory of probability to inference 
problems of more substantive size and, thus, to more real- 
istic and practical problems [23]. Since then, Bayesian 
networks have also attracted researchers in legal sciences, 
and this tendency has considerably intensified through- 
out the last decade [24]. Aitken and coauthors [25,26], for 
example, investigated the potential of Bayesian networks 
for specific case analysis, also known as 'offender profiling'. 
Based on a dataset covering the details of several hundred 
cases of sexually motivated child murders and abduc- 
tions (that is, incidents reported in Great Britain since 
1960), the authors propose different graphical models to 
relate the key parameters of a case. These models may 
be used to revise the probability of offender characteris- 
tics, given the information about the victim and the crime. 
More recently, the use of Bayesian networks has also been 
reported for crime risk factor analysis [27] as well as for 
terrorism risk management [28]. Within forensic science, 
they now constitute a major direction of research [20]. 
Beyond legal applications, such as the modeling of his- 
torically causes celebres [29-32], Bayesian networks are 
used in virtually any field that needs to deal with inference 
under circumstances of uncertainty (for example, medical 
diagnosis, engineering). 

Methodology 

In this paper, a Bayesian network approach is proposed 
because it allows one to point out the logic underlying 
current probabilistic analyses of the database search prob- 
lem in various ways. Making these arguments plain is 



relevant not only for teaching, but also for supporting dis- 
cussion within the scientific community. There is a need 
for this essentially because the developments based on 
formulae alone may not be found easy to apprehend by 
all participants within a discussion. Yet, agreement on 
such evaluative matters is essential in order to assure that 
the forensic community can take a credible stance with 
respect to recipients of expert information, in particu- 
lar, legal decision makers (such as magistrates or courts 
of law). Moreover, there are also recent recommenda- 
tions from professional bodies, for example [33], that 
diverge from the prevalent viewpoint stated above. This 
is a cause of concern and illustrates the continuing need 
for formalisms that provide support in analyzing and 
communicating probabilistic approaches [21]. 

Results and discussion 

The 'island' problem 

General description and notation 

Consider a biological stain found on a crime scene. It has 
been typed and found to have the genetic profile Gc. It 
is assumed here that the method applied for determining 
the genetic profile of a biological sample works perfectly 
accurate. The 'island' on which the crime was committed 
has a population of size Initially, there is no informa- 
tion that directs suspicion to any of the N islanders. Thus, 
all of them are equally believed to be the source of the 
crime stain. Since the stain is found to be of type G^, so 
must be the person from which the stain comes. A suspect 
comes to police attention and his blood is analyzed. He is 
found to have the genetic profile G^. It corresponds to that 
observed for the crime stain: Gc = Gs^ On the basis of this 
information, the question of interest is as follows: 'How 
convinced should one be that the suspect is the source of 
the crime stain?' 

In order to approach this question, information about 
the occurrence of the corresponding genetic profile is 
needed. Let us suppose that, on the basis of a survey 
of a comparable population on another island, the target 
profile can be taken to occur in about 1% of the popu- 
lation and that this rate, written as y for short, can also 
be retained for the population of the island on which 
the crime stain of interest was found. It is also supposed 
here that knowledge of the suspect's genotype, G^, does 
not affect one's probability that another islander has that 
profile. 

The formal analysis of this inference problem requires 
some further notation. Within the population of N indi- 
viduals, let us index the suspect as person 1 and the 
remaining individuals as 2...N. Next, let the proposition 
that a given person / is the source of the crime stain be 
denoted as Hi. The term Hi thus stands for the propo- 
sition that the suspect is the source of the crime stain. 
Analogously, the propositions according to which one of 
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the remaining Ts/' — 1 people is the source of the crime stain 
are denoted as H2, ...,//a/^. Throughout this paper, propo- 
sitions will be abbreviated with capital letters, whereas 
probability assignments will be written shorthand by 
Greek symbols. 

The initial probability that a given individual is the 
source of the crime stain will be written as Pr{Hi) = jti. 
Since it is considered, as a starting point, that each of the N 
persons could be the source with equal probability, one 
has 7Ti = l/N and Xl/li = 1- ^^^er sections, further 
notation is introduced in order to allow for the possibility 
that some of the N individuals are part of a database. 

Probability that the suspect is the source of the crime stain 

In the setting considered at this point, the suspect is 
the only typed individual among the N persons. Let us 
write Ml for the finding that his genotype, G5, corre- 
sponds to that of the crime stain, The probability that 
the suspect is the source of the crime stain is then given by 
Bayes' theorem for discrete evidence and multiple discrete 
propositions: 



Pr(Hi\Mi) = 



Pr(Mi I Hi)Pr(Hi) 



Pr(Mi \Hi)Pr(Hi) +Ef=2^KMi \Hi)Pr{Hi) 

(1) 

Here, the conditional probability of the evidence Mi 
given Hi is also called the likelihood of the propo- 
sition given the evidence, sometimes written as Li, 
Equation 1 can thus be given in a more compact form: 



Pr{Hi I Ml) 



LiTTi 



(2) 



The likelihood for any person / other than the suspect, 
that is, the conditional probability of the observed corre- 
spondence given that some person other than the suspect 
is the source of the crime stain, depends on the occurrence 
of the corresponding features in the population: Pr(Mi \ 
Hi) = Li = y, for / 7^ 1. Moreover, the probability that 
some person other than the suspect is the source of the 
crime stain is the complement of the probability that the 
suspect is the source. Therefore, J2^2^i — 1 — ^i- The 
term ^i^i ^^^^ rewritten as follows: 

^LiTTi = ^YTTi = y^7Ti = 7(1 -TTi) . 

/=2 /=2 

Assuming that the suspect will certainly match if he is in 
fact the source of the crime stain, Pr{Mi \ Hi) = Li = 1, 
the posterior probability 7t[ that the suspect is the source 
of the crime stain, after considering the evidence Mi, thus 
is as follows: 



7t[ = Pr(Hi I Ml) 



Ttl 



Bayesian networl< for the island problem 

The result from the previous section can be tracked in a 
Bayesian network as shown in Figure li. 
This model contains the following elements: 

1. Node N. This is a numeric node with 

states 2, 10, 100, and 1,000 (other numbers may 
obviously be chosen) and represents the size of the 
suspect population, that is, the individuals which 
could have left the crime stain. 

2. Node H. This node has two states. The state Hi 
represents the proposition 'The suspect is the source 
of the crime stain'. The state Hi represents the 
composite proposition 'one of the other N — 1 
individuals is the source of the crime stain'. It is an 
aggregation of all propositions Hi (for / = 2, A/'). 
The probability table of node H contains 
probability tti = l/N for the state Hi 

and (N — l)/N (which is equivalent to (1 — tti)) for 
the state Hi (see Table 1). 

3. Node y. This node contains numeric states that 
represent the rate at which the corresponding 
genetic feature appears in the population. For the 
purpose of illustration, the values 0.01 and 0.1 are 
chosen. Notice that this node is not strictly 



TTi + y(l -7ti) 



(3) 





H 


mm 


50.25 


Hi m 


49.75 



r 



0.011 

0.1 




(i) 



(ii) 



Figure 1 Compact and expanded representations of a Bayesian 
networlc for a one stain one offender case, (i) Formal outline of a 
Bayesian network for evaluating a correspondence between the 
profile of a crime stain and that of a sample from a suspect, according 
to Equation 3. The setting relates to one in which the population of 
potential offenders is of size A/ and either the suspect (H] ) or one of 
the other N — 1 individuals {H]) is the source of the crime stain 
(proposition H). The corresponding genetic feature occurs in the 
population with rate y. (ii) Evaluation of a situation in which the size of 
the population \sN = 1 00, y is 0.01 , and the suspect's profile is found 
to correspond to that of the crime stain [M] ). The posterior probability 
that the suspect is the source of the crime stain, Pr(H] \ M] ), is shown 
in the node H. It takes the value 0.5025. Instantiated node states are 
shown in bold, and probabilities are displayed in percentages. 
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Table 1 Probability table for node H 



N: 


2 


10 


100 


1,000 


Hi 


0.5 


0.1 


0.01 


0.001 


Hi 


0.5 


0.9 


0.99 


0.999 



Conditional probabilities assigned to the states H] and H] of the node H. 



necessary. It would also be possible to specify y 
directly in the probability table of the node Mi. A 
representation of y in terms of a distinct node is 
retained here for the reason of providing a detailed 
decomposition of the problem at hand. 
4. Node Ml. This node has two states Mi ('The 
suspect's profile corresponds to that of the crime 
stain') and Mi ('The suspect's profile does not 
correspond to that of the crime stain'). If the suspect 
is in fact the source of the crime stain (that is, 
proposition Hi holds), then the correspondence. Mi, 
is assumed to occur with certainty (irrespective of 
the rarity of the corresponding characteristic, 
expressed by y). Otherwise (that is. Hi being true), 
the correspondence occurs as a function of the rate y 
with which the corresponding feature appears in the 
population. The probability table of the node Mi 
thus completes as shown in Table 2. 

An important aspect of the current development is 
that the scientific evidence is confined solely to the fact 
that the suspect's profile is found to correspond with the 
profile of the crime stain. Nothing is said about how mem- 
bers of the remaining N — 1 individuals compare to the 
crime stain. 

For the purpose of illustration, let us assume that the 
size of the suspect population is N = 100, and the 
rate y at which the corresponding genetic characteris- 
tic occurs in the population is 0.01. Further, according to 
Equation 3 and assuming a prior probability of 1/A^ for 
each of the N individuals, the probability that the stain 
comes from the suspect is 0.01/(0.01+0.01 x (1-0.01)) = 
0.5025. This result can also be found via the proposed 
Bayesian network. A visual illustration of this is given in 
Figure lii. The instantiated nodes (that is, nodes set to the 
state 'known') are shown in bold. The target probability, 
Pr(Hi I Ml), is displayed in the node H, 



When some islanders are in a database 
Formal analysis 

The island problem as described in the previous section 
is now slightly modified. It will still be assumed that the 
variable N represents the size of the total population. 
However, the analysis will suppose that the DNA profiles 
of the first 1, n individuals (where index 1 is that of the 
suspect) are in a database. The individuals (w + 1), N are 
outside the database. Also part of the assumptions in this 
scenario is that the profile of the crime stain is compared 
to all n individuals. This search of the database reveals 
that only the profile of the suspect corresponds to the pro- 
file of the crime stain. This correspondence is denoted, 
as before, by Mi. Besides, the database search has also 
revealed that the 2, n individuals on the database other 
than the suspect do not match. The fact that a profile of 
an individual / (for i = 2, n) does not correspond to 
the crime stain is denoted here by X/. We can thus write 
X2&X3&...&X^ for the information that all entries of the 
database other than that of the suspect do not correspond. 
The latter two items of evidence need to be jointly eval- 
uated, so let us write, following [18], the totality of the 
evidence as£^ = Mi&X2^3&...^fz. 

Considering that there are n of the N individuals in a 
database leads to a minor refinement in the way in which 
the source level propositions Hi (for i = 2,..., A/") are 
formulated. In fact, they can now be framed as 'the indi- 
vidual i in the database is the source of the crime stain! 
A more conceptual underpinning of the latter proposi- 
tions is that they refer to individuals who had their DNA 
profile compared to that of the crime stain. This is a 
difference with respect to the individuals (n + 1),...,A/' 
whose profiles were not compared. On the whole, one 
can thus think of the population of size A/' as a splitting 
into n individuals as database members and N — n that 
are not. This splitting becomes apparent when rewriting 
the posterior probability defined earlier in Equation 1. 
Writing this probability for the evidence En gives 
the following: 

PrjEn I Hi)Pr{Hi) 

Pr(Hi\En) = —y z ^ . 

/Pr{En\Hi)Pr{Hi) + Y!l=2Pr{En\Hi)Prm\ 

\ +T.t^^^Pr{En\Hi)Prm ) 

(4) 



Table 2 Probability table fornode/W Alternatively, invoking the abbreviated notation, this 

formula takes the following form: 

7t[ = Pr(Hi \En) = ^^^^ . 

Conditional probabilities assigned to the states Mi and Mi of the node M. (5) 



H: 


Hi 




Hi 




y- 


0.01 


0.1 


0.01 


0.1 


Ml 


1 


1 


0.01 


0.1 


Ml 


0 


0 


0.99 


0.9 
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Since it is still assumed here that the initial probabili- 
ties Pr{Hi) are given by 1/N, it becomes relevant to draw 
attention to the likelihoods Pr{En \ Hi) because they will 
determine whether or not the posterior probability of Hi 
given Eyi (Equation 4) is different from the posterior prob- 
ability of Hi knowing only the match of the suspect, Mi 
(Equation 1), and nothing about the matching status of all 
the individuals other than the suspect. 

Consider the following: 

1. Pr{En \Hi), This term represents the probability that 
the suspect's profile corresponds to that of the crime 
stain and that none of the other n — 1 members on 
the database correspond, given that the suspect is the 
source of the crime stain. The suspect is assumed to 
match certainly, if he is in fact the source, whereas 
each of the n — 1 individuals may correspond with 
probability y. The probability that none of the latter 
individuals corresponds thus is (1 — y)^~^. We can 
thus write Pr(^^ | //i) = 1 x (1 - y)''"^ or 

Li = (l- yY-^ for short. 

2. PriEn I Hi), for / = 2, n. This term represents the 
likelihood for the other n — \ individuals in the 
database. Clearly, given the stated assumptions about 
the reliability of the typing DNA technique, one 
would expect to have a match among the n — \ 
individuals on the database if the true source is 
among them. Therefore, the probability of 
observing En, that is, a match with the suspect but 
with none of the other n — \ database members, is 
zero: Li = 0 for / = 2, n. 

3. Pr{En I Hi), for / = « + A/". This term represents 
the likelihood for each individual outside the 
database. If one of the / = « + A/' individuals is 
the source of the crime stain, then the suspect may 
match with probability y, and all members on the 
database other than the suspect will not' match with 
probability (1 — Therefore, the likelihood 
that Li for each individual / = + 1, A/' 

is 7(1 - y)^-^ 

Equation 5 thus changes to become the following: 

, L\7t\ 

7i[ = Pr(Hi I £„) = — 

i=2 

0 

^ (l-Y)"-^Tti 

(1 - YY-^Tt^+Htn+i y(i - vr-^^i ' 

(6) 

In the denominator, the constant y(l — yY~^ can be 
taken out of the sum. In addition, (1 — y)^~^ cancels in 



both the numerator and the denominator. This leaves one 
with the following: 

7t[=PriHi I En) = ^ . (7) 

The logic of this result is that the second term in 
the denominator, y smaller than y(l — tti) 

in Equation 3. This latter expression involves a sum of 
prior probabilities over the entire population (with no 
one except the suspect being in the database) minus the 
suspect. The former, in Equation 7, involves only a sum 
over those members of the population which are not 
in the database. Stated otherwise, the prior probabilities 
for the individuals in the database which are found to 
have profiles different from that of the crime stain can- 
cel because of the multiplication with the zero likelihood^. 
Because of a smaller denominator, the posterior probabil- 
ity 7t[ in Equation 7 turns out to be greater than that in 
Equation 3. The selection of a suspect in a database along 
with an exclusion of other database members by DNA evi- 
dence thus reunites more evidence against the matching 
suspect. 

Bayesian network for a database search setting: suspect and 
one other individual in the database 

The Bayesian network earlier described in Figure 1 can 
serve as a starting point for extending analyses to sit- 
uations involving the search of a database. In order to 
point this out in a stepwise procedure, let us start with 
a situation in which there are only two individuals in the 
database {n = 2), the suspect and one other person. The 
following modifications are introduced in the graphical 
model (see also Figure 2): 

1. Node H, A distinct proposition H2 is introduced. It 
refers to the proposition according to which the 
individual 2 - the second individual on the database 
besides the suspect - is the source of the crime stain. 
As before (section 'Bayesian network for the island 
problem'), the proposition Hi states that the suspect 
(that is, the individual indexed as 1) is the source of 
the crime stain. The previous proposition Hi, 
accounting for all individuals in the population of 
size N except the suspect, is modified to //sjy. This 
latter proposition specifies that the true source is 
among the N — n individuals outside the database (as 
noted above, n is set to 2 for the time being). The 
probability table of the node H completes as follows 
(« = 2): 

Pr(Hi I N) = Pr(H2 \ N) = 1/N, 
Pr(H3j^\N) = (N-n)/N. 
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Figure 2 Bayesian network for assessing a single database 'hit'. 

Structure of a Bayesian network for evaluating a correspondence 
between tine profile of a crime stain and that of a sample from a 
suspect when the suspect is on a database along with n - 1 other 
individuals whose DNA profiles do not correspond. The size of the 
population of potential offenders is N. Among the N individuals, n 
(with n < N) are on a database. The node H has three states: 'the 
suspect is the source of the crime stain' (Hi ), 'the second individual in 
the database is the source of the crime stain' {H2), and 'the source of 
the crime stain is among the N — n (here, n = 2) individuals outside 
the database' (/-/3_/v). The corresponding genetic feature occurs in the 
population with rate y. The node X2 is binary and represents the 
proposition according to which the profile of individual 2 (in the 
database) does not correspond to the crime stain. 



It is still assumed that, initially, each member of the 
population of size N has the same probability of 
being the source of the crime stain. 

2. Node X2. This is a newly introduced binary node 
with states X2, defined as 'the profile of individual 2 
in the database does not correspond to the crime 
stain profile', and X2, defined as 'the profile of 
individual 2 corresponds to that of the crime stain'. 
For situations in which individual 2 is not the source 
of the crime stain, the probability that it will 
nevertheless be found to correspond depends on the 
rarity of the characteristic. Therefore, node X2 
depends on the node y. The probability table for the 
node X2 completes as shown in 

Table 3. 

3. Node Ml. The definition of this node is the same as 
that given earlier in the section 'Bayesian network for 
the island problem'. However, an extension of the 
probability table is necessary because of the modified 
states of the node H. This is shown in 

Table 4. 

In order to investigate the properties of the proposed 
Bayesian network, consider again a setting in which the 
population of potential sources is of size N = 100, and the 



Tables Probability table for node X2 



H: 


H^ 




H2 








y- 


0.01 


0.1 


0.01 


0.1 


0.01 


0.1 




0.99 


0.9 


0 


0 


0.99 


0.9 


X2 


0.01 


0.1 


1 


1 


0.01 


0.1 



Conditional probabilities assigned to the states X2 andX2 of the node X2. 



rarity of the crime stain genotype is 7 = 0.01. Introduc- 
ing the evidence Mi, that is, a correspondence between 
the DNA profile of the suspect and that of the crime stain 
changes the prior probability of Pr{H\) = 1/N = 0.01 
into a posterior probability oiPr{H\ \ M\) = 0.5025. This 
is a result found earlier in the 'Bayesian network for the 
island problem' section. As shown in Figure 3i, the calcu- 
lations in the Bayesian network constructed in this section 
lead to the same finding. 

At this point, nothing has been communicated yet to 
the Bayesian network about whether or not the second 
individual on the database, besides the suspect, has a cor- 
responding profile. Notwithstanding, something can be 
said about the probability that the second individual in 
the database would match. As shown in Figure 3i, the 
probability that individual 2 would not match (that is, 
state X2 being true), given knowledge of Mi, is 0.985. 
The logic of this result can be derived from the Bayesian 
network. In fact, that probability is the sum of the prod- 
ucts of the conditional probabilities of X2 given each 
state of the node H and the actual probabilities of these 
latter states: 

Pr(X2 I Ml) = Pr(X2 \ Hi)Pr(Hi \ Mi) 

+ Pr(X2 \H2)Pr(H2 | Mi) (8) 
+ Pr(X2 \ H3j^)Pr(H3M \ Ml) 

Given that individual 2 is taken to match with certainty 
if that individual is in fact the source of the crime stain, 
one has Pr{X2 \ H2) = 0. Consequently, the term in the 
center of Equation 8 cancels. Under the remaining propo- 
sitions, individual 2 matches with probability (1 — 7). 
Using shorthand notation for the posterior probabilities 



Table 4 Modified probability table for node Mi 



H: 


Hi 




H2 




H3_N 




y- 


0.01 


0.1 


0.01 


0.1 


0.01 


0.1 


Ml 


1 


1 


0.01 


0.1 


0.01 


0.1 


Ml 


0 


0 


0.99 


0.9 


0.99 


0.9 



Conditional probabilities assigned to the states M] and M] of the node M] . 
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Hi ■ 50.25 
H2 00.50 
i/3_Nl 49.25 



r 



0.011 

0.1 



1100 

0 



Ml 




X2 


Ml 100 








Ml 0 










(i) 







Ml 



Ml I 
Ml 



1100 

0 





2 


0 


10 


0 


100 H 100 


1000 


0 






H 


i/i ■ 50.51 




0 


i/3_Nl 49.49 



r 



0.011 

0.1 



1100 

0 



X2 




HlOO 


XI 


0 



(ii) 

Figure 3 Expanded representations of a Bayesian networic for assessing a single database 'hit'. Bayesian network (with nodes shown in 
expanded form) for evaluating a correspondence between the profile of a suspect and that of a crime stain, as defined in Figure 2. Fixed node states 
are shown in bold. The network (i) shows an evaluation of the information that the suspect's profile is found to correspond (Mi = true) when 
A/ = 1 00 and y = 0.01 . The posterior probability that the suspect is the source of the crime stain is shown by the state H] in the node H. The 
network (ii) shows a situation in which the additional information about the second (non-matching) individual on the database is known. 
Probabilities are shown in percentages. 



of H defined earlier in the text, Equation 8 becomes the 
following: 



Pr(X2 I Ml) = (1 - y)n[ + (1 - y)n^j^ 
= (1-K)(7r{ +7r^^) 
= 0.99 X (0.5025 + 0.4925) = 0.9850 .(9) 



As a next step in analyzing the proposed Bayesian net- 
work, one can consider the incorporation of knowledge 
about individual 2. For the purpose of the current discus- 
sion, assume that this person is found not to correspond. 
This amounts to considering X2 to be true. Introducing 
this information into the Bayesian network leads to the 
result shown in Figure 3ii. As may be seen, the probabil- 
ity that the suspect is the source of the crime stain has 
increased from 0.5025 to 0.5051. This latter result corre- 
sponds to that which is obtained by applying Equation 7. 

The Bayesian network discussed here provides a means 
to make plain the changes in the source level propo- 
sitions H through the consideration of the result of a 
database search. By saying that individual 2 does not cor- 
respond, H2 is 'falsified': as can be seen in Figure 3ii, the 
state H2 of the node H now has a zero probability. As 
a logical implication, the probability previously assumed 
by this state must be 'redistributed' among the remain- 
ing propositions Hi and //sjv, and this explains why their 
probabilities change in the described way. 



A reverse analysis of the database search problem 

The analysis of the currently discussed Bayesian net- 
work has allowed to point out two known aspects of the 
database search issue: 

1. One aspect is that information about the result of a 
database search represents an additional item of 
evidence. 

2. A second aspect is that information about non- 
matching individuals in a database tends to increase 
the strength of the evidence against the suspect. 

As pointed out at the end of the previous section, the 
logic of the strengthened evidence against a matching sus- 
pect can be understood by considering that the circle of 
potential suspects is reduced when finding non-matching 
individuals. 

In order to illustrate these ideas in some further way, 
one can rely on the fact that the final result of applying 
the Bayes' theorem is invariant to the order of sequen- 
tially applied items of evidence. Consider this in terms 
of a particular example in which the true source of the 
crime stain is among only three persons (that is, N = 3) 
and the suspect is one of them. Consequently, one has the 
three propositions //i,//2 and H3 with initial probabili- 
ties 7Ti = 1/N = 1/3 (for / = 1,2,3). Assume further, 
as before, that two individuals are in a database, that is, 
the suspect and one other person (thus, n = 2), That 
other person, individual 2, has a DNA profile that dos not 
correspond to that of the crime stain. This information 
is denoted as X2. It is possible to calculate the posterior 
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probability that the suspect is the source of the crime 
stain given the 'sole' information that individual 2 does 
not correspond. Let us write this (intermediate) posterior 
probability as = Pr(Hi \ X2), It is obtained as follows: 



7t^ = Pr{Hi \ X2) 



Pr{X2 I Hi)Pr(Hi) 



>r(X2 I Hi)Pr{Hi) + Pr{X2 \ H2)Pr{H2) \ 
+Pr{X2 I H3)Pr{H3) J 



(10) 



Under H2, it is not possible that X2 is true. Therefore, 
the term in the center of the denominator cancels. Given 
that the other likelihoods Li (for / = 1, 3) are equal^, as 
well as the prior probabilities 7ti (for / = 1, 3), this leaves 
one with the following: 



7t^=Pr(Hi\X2) 



Pr{X2 I Hi)Pr{Hi) 



Pr(X2 I Hi)Pr(Hi) -\- Pr(X2 \ H3)Pr(H3) 



LlTtl 



(1 - y)7ri 



LiTti + LsTts 2(1 - y)7ti 



= 0.5. 



(11) 



The suspect will certainly be found to correspond 
under Hi, whereas under 7/3, he will do so with 
probability y. Given that tt* = 7r| = 0.5 from 
Equation 11, the posterior 7t[ can be found to be 0.5/(0.5+ 
y * 0.5) = 0.990099. 

The same result is obtained when applying both Mi 
and X2 to the tti = 1/3 prior in a single step. In fact, 
using £2 = {MiyX2} in Equation 6 with m = 713 = 1/3 
leads to the following: 



7t[ = Pr(Hi I = 



LlTtl 



(1 - y)7ti 



(l-y)7ri + y(l-}/)7r3 



= 0.990099 . 



(13) 



These results can also be tracked within the currently 
discussed Bayesian network. Figure 4 shows the starting 
point that is characterized by the population of size N = 3 
and the rarity y = 0.01 of the corresponding genetic trait. 
Initially, the probability that the suspect will be found to 
correspond is given by the following: 



The initial probability that the suspect is the source of 
the crime stain has thus increased from 1/3 to 1/2. This is 
an expression of the redistribution of probability among 
two instead of three individuals who are equally likely to 
be the source of the crime stain. 

To some extent, this inference problem is comparable to 
the Monty Hall puzzle, also known as 'Let s make a deal', 
a televised American game show hosted by Monty Hall. 
In that game, the contestant will learn about which of the 
three doors does not hide a prize. Based upon this infor- 
mation, the contestant is concerned with re-evaluating'^ 
the probability with which the remaining two doors hide 
the prize. 

As a next step, one can add the information about the 
correspondence between the suspect s profile and that of 
the crime stain. Mi. The intermediate posterior prob- 
ability of Hi given knowledge about the non-matching 
individual 2, X2, provides the new prior' for this. Assum- 
ing independence between X2 and Mi given /f, Bayes' 
theorem can be written as follows: 



7t[=Pr(Hi\X2.Mi) 



PrjMi I Hi)Pr{Hi \ X2) 
( Pr{Mi I Hi)Pr{Hi \ X2) \ 
\+PKMi \ H3)Pr(H3 \ X2) J 

PrjMi I //i)7r* 
Pr(Mi I Hi)7t* -\-Pr(Mi \ H^)^^ ' 

(12) 



Pr{Mi) = Pr(Mi \ Hi)Pr(Hi) + Pr(Mi \ H2)Pr(H2) 
+ Pr(Mi \H3)Pr(H3) 

= 1 X TTi + yK2 

+ yjts = 1/3 + 2/3y = 0.34 . 



The probability that individual 2 will not correspond, 
X2y is also given by the logic of the 'extension of the 
conversation': 



Pr(X2) = Pr{X2 I Hi)Pr(Hi) + Pr(X2 \ H2)Pr{H2) 
+ Pr(X2\H3)Pr(H3) 
= (1 - y)7ti + 0 X 7r2 

+ (1 - y)7t3 = 2/3(1 -y) = 0.66 . 



Figure 4ii shows the state of the Bayesian network 
after consideration of the fact that individual 2 does 
not correspond to the crime stain. This changes the 
1/N = 1/3 prior for tti to = 0.5, as found through 
Equation 11. Accordingly, the probability of finding the 
suspect to correspond, Afi, increases to the following: 



PriMi I X2) = Pr(Mi \ Hi)7t^ + Pr(Mi \ H3)7t^ 
= 1 X 0.5 + yO.5 = 0.505 . 
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3 HlOO 


100 


0 


1000 


0 






H 


HiM 


33.33 




33.33 




33.33 






Ml 


MiM 


34.00 


Mim 


1 66.00 



r 


0.01 H 100 


0.1 


0 










66.00 


xim 


34.00 



(i) 





3 HlOO 

100 0 
1000 0 






H 


HiM 

H2 


50.00 
0 

50.00 


\ 




Ml 


MiM 


50.50 
49.50 



r 



0.011 

0.1 



1100 

0 



X2 



X2^ 

X2 



1100 

0 



(ii) 



A^ 


3 H 


100 


100 




0 


1000 




0 




1 


H 




9.01 






0 






0.99 



r 



0.011 

0.1 



1100 

0 



Ml 



Ml I 

Ml 



1100 

0 



X2 



X2^ 

X2 



1100 

0 



(iii) 



Figure 4 Expanded representations of a Bayesian networic for assessing a single 'hit' in a database of reduced size. Bayesian network (with 
nodes shown in expanded form) for evaluating a correspondence between the profile of a suspect and that of a crime stain, as defined earlier in 
Figure 2. Fixed node states are shown in bold. The network (I) represents an initial situation in which the size of the population \sN = 3, and the 
corresponding characteristic occurs with probability y = 0.01 . (ii) The state of the Bayesian network after introducing information about the 
non-matching individual 2 (that is, Xj). (iii) The state of the Bayesian network after adding the information that the suspect's profile corresponds to 
that of the crime stain (that is, M] ). Probabilities are shown in percentages. 



A last step then consists in adding the information that 
the suspect corresponds, Mi, This is shown in Figure 4iii. 
In this figure, the node H displays the posterior probabil- 
ity Pr(//i I X2,Mi) = 7t[= 0.9901, which agrees with the 
finding of Equation 13. 

Bayesian network for a search of a database of size n > 2 

So far in this paper, the discussion of Bayesian networks 
has focused on situations in which there was no database 
(that is, the 'island problem') or a database with only two 
entries (that is, the suspect and one other individual). This 
way of presentation allows one to point out the logic of the 
approach in situations where the results are immediately 
compelling. The proposed Bayesian network procedure 
can however be extended to arbitrary numbers of N (that 
is, size of suspect population) and n (that is, database size), 
with N > n. Hereafter, this is outlined in some further 
detail. 

Figure 5 represents a generalization of the Bayesian net- 
work shown in Figure 2 to situations in which the size 
of the database n is greater than 2 (with n < N), The 
following modifications are introduced: 

1. The size of a database is modeled explicitly in terms 
of a distinct node n with exemplary numerical 
states 2, 10, 100 (other database sizes n <N may 
obviously be chosen). 

2. The node H has three states. Hi represents the 
proposition according to which the suspect is the 
source of the crime stain. The proposition according 
to which one of the individuals 2, n is the source of 
the crime stain is represented by the state //2_w The 
third state is //^+ijv. It represents the proposition 
that one of the N — n individuals outside the database 
is the source of the crime stain. Assuming again prior 



probabilities ofl/N for each of the N individuals, the 
following node probabilities are specified: 

Pr(Hi) = l/N, Pr(//2Jv) = n- l/N, 
Pr(//^+ijv) = (N- n)IN. 




Figure 5 Bayesian networic for a search) of a database of size 

n > 2. Structure of a Bayesian network for evaluating a 
correspondence (Mi) between the profile of a crime stain and that of 
a sample from a suspect when the suspect is on a database along 
with n - 1 other individuals. The size of the population of potential 
offenders is N where n (with n < N) of them are on a database. The 
node H has three states: 'the suspect is the source of the crime stain' 
(/-/]), 'one of the n - 1 other individuals in the database is the source 
of the crime stain' {H2_n), and 'the source of the crime stain is among 
the N - n individuals outside the database' {Hn+] _n)- The 
corresponding genetic feature occurs in the population with rate y. 
The nodeX'2&-&^n is binary and represents the proposition according 
to which the profiles of the n - 1 individuals in the database, other 
than the suspect, do not correspond to the crime stain. 
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3. The probability table for the node Mi, the 

proposition according to which the suspect's profile 
corresponds, contains the following values: 



Pr(Mi \Huy) 



1, / = 1, 
7, / # 1 . 



4. The node X2&...&X;^ represents the proposition 
according to which the n — 1 individuals in the 
database other than the suspect have profiles that do 
not correspond to that of the crime stain. The node 
probability table contains the following assignments: 



(1 - y)"-^ / = 1, 

0, / = 2, n, 



Figure 6 provides a graphical illustration of the Bayesian 
network described in this section. In Figure 61, the initial 
situation is one with the database of size n = 100, which 
equals the size of the population of potential offenders, 
N, As a definitional implication of this, the prior proba- 
bility for the suspect being the source of the crime stain 
is 1/N = 0.01 and that for the n — 1 individuals in the 
database other than the suspect is given by the comple- 
ment, (N — 1)/N = 0.99. Because there are no potential 
sources outside the database, the initial probability of the 
proposition Z/^+uv is zero. Figure 6ii illustrates the effect 
of learning that none of the individuals 2, n has a profile 
matching that of the crime stain. This has two logical 
consequences. Firstly, the proposition H2_n must be false. 
Secondly, probability must thus be redistributed' among 
the remaining possible' propositions. The proposition Hi 
is the only one of this kind. It thus assumes probability 1. 



A Bayesian network guided derivation of the 'database 
search likelihood ratio' 

So far in this paper, the discussion has concentrated on 
the evaluation of database search results given multiple 
propositions. In fact, each individual / of the population 
of potential sources^ (of size N) was considered in terms 
of a distinct proposition ///. In order to facilitate the pre- 
sentation, the / propositions have been grouped: Hi refers 
to the suspect only, H2_n refers to the n — 1 individuals 
in the database other than the suspect, and Hn+ijq refers 
to the individuals outside the database. The consideration 
of multiple proposition has directed the analysis to poste- 
rior probabilities of the single proposition Hi (that is, 'the 
suspect is the source of the crime stain'). Likelihood ratios 
have not primarily been addressed here because they com- 
pare propositions in pairs. In the analysis of scientific evi- 
dence, likelihood ratios play an important role, however, 
so that it is desirable to include them in this discussion. 

A general likelihood ratio procedure for comparing 
more than two propositions has been described, for exam- 
ple, by [34]. It will be used hereafter to derive a likeli- 
hood ratio with reference to the Bayesian network shown 
in Figure 5. It starts by grouping the propositions H2_n 
and//^+ijv as the proposition //i. It represents the propo- 
sition according to which the crime stain comes from 
some other person than the suspect (either from some 
other person in the database or from someone outside the 
database). This proposition forms a pair along with Hi, 
that is, 'the suspect is the source of the crime stain! 
Following considerations outlined in the section 'Bayesian 
network for a search of a database of size n > % 
let X2&...&X^ denote the evidence that none of the n — 1 
individuals in the database has a DNA profile correspond- 
ing to that of the crime stain. Assuming that the prior 
probabilities for each of the propositions Hi can be given, 
the ratio of the probability of the the evidence given each 
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Figure 6 Expanded representations of a Bayesian network for a search of a database of size n > 2. Bayesian network shown in Figure 5 witli 
expanded representation of nodes, (i) A situation in wliicli tine size of tine database n equals that of the suspect population N = 1 00. The rarity of 
the corresponding characteristic is set to 0.01 . (ii) The additional information about the n — 1 non-matching individuals of the database is 
introduced. Probabilities are shown in percentages. 
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of the pair of propositions Hi and Hi, called database 
likelihood ratio here {LRdb)i can be evaluated as follows: 



LRdb = 



Pr{X2^...^n I Hi) 
Pr{X28>L,.,8>LXn I Hi) 
Pr{X2^,,,%cXn I Hi){Y.t^Prm] 
Zt^MXiSc^.&cXn I Hi)Pr(Hi) 



(14) 



The denominator of this expression can be extended as 
follows: 



i=2 



J2MX2Sc...ScXn\Hi)Pr(Hi) 

2 

N 

+ J2 Pr(X2&c...&LX^ I Hi)Pr(Hi). 



i=n-\-l 

The first part of this sum cancels because the likelihoods 
for the n — I individuals in the database, other than the 
suspect, are zero. The likelihood ratio. Equation 14, thus 
reduces to the following: 

Pr(X2Sc,„ScXn \ Hi) 

j^Rdb = = — 

Pr(X2Sc,..&cXn I Hi) 

^ Pr(X2&c,,,&cXn I Hi){j:t2MHi)} 

E£«+i^K^2&...&X^ I Hi)Pr(Hi) * 

The likelihood for the suspect and the individuals out- 
side the database is (1 — y)^~^. The prior probability for 
each individual / to be the source of the crime stain is, as 
it was assumed throughout this paper, 1/N. Equation 15 
can thus be rewritten as follows: 



(15) 



LRdb = 



(1 - Y)"- 



-I JV-l 
N 



(1 - Y)"- 



-1 Af-l 
N 



-11 

N 



(1 



(l-yr-i^ N-l 



(1 - y)^- 



N 
-1 N-n 
N 



N-n 



y) 2^i=n-\-l N 

(16) 



According to this result, the likelihood ratio is maximal 
when the size of the database, n, equals that of the popula- 
tion of potential sources, N, The logic of this result is also 
illustrated by the Bayesian network depicted in Figure 6ii. 
It shows that knowledge of X2&...&X^ implies the truth 
of Hi in a setting in which N = n. Conversely, if the sus- 
pect is the only person in the database {n = 1), this means 
that there is no information about excluded individuals. 
Accordingly, with n = 1, the value for LRdb is one. 

The Bayesian network discussed so far (Figure 5) can 
be adapted in order to illustrate a likelihood ratio evalu- 
ation. As a minor modification, it is necessary to add a 
summary node Hi with two states Hi ('the suspect is the 
source of the crime stain') and Hi (some person other 
than the suspect is the source of the crime stain'). The 



latter state regroups the two propositions H2_n and Z/^+uv 
of the node H. The node Hi is added as a descendant of the 
node H The probability table contains the logical values 0 
and 1 as shown in Table 5. 

This extension is shown in Figure 7. The figure on the 
left shows an evaluation of the numerator of the likeli- 
hood ratio. The node Hi is set to Hi which implies also 
that that node H will display Hi as 'true'. Because the rar- 
ity of the characteristic y is set to 0.01 and the size of 
the database n to 100, the probability that none of the 
other n — 1 individuals in the database has a correspond- 
ing profile is (1 - y)"-^ = 0.99^^ = 0.3697. This value is 
shown in the node X2&...&X;^. 

The evaluation of the denominator is shown in 
Figure 7ii. Here, the node Hi is set to Hi . This implies that 
the state Hi in the node H is zero. Accordingly, proba- 
bility is redistributed proportionally among the remaining 
propositions H2_n and /Z^+uv- In fact, if the suspect is not 
the source of the crime stain (that is. Hi is true), then (a) 
there isaprobability of (n-l)/(A/^-l) = 99/999 = 0.0991 
that someone other than the suspect inside the database 
is the source of the crime stain, and (b) there is a prob- 
ability of (N - n)/(N - I) = 900/999 = 0.9009 that 
someone outside the database is the source of the crime 
stain. These two probabilities are displayed in the node H, 
Finally, the probability that none of the n — 1 individuals in 
the database (other than the suspect) matches, given that 
the suspect is not the source of the crime stain, is given as 
follows: 

Pr(X2&c...&cXn\Hi) = Pr(X2&c...&cXn\H2_n)Pr(H2_n\Hi) 

+Pr(X2&...&X^ |//^+ijv)i^r(//„+ijv |//i) 

= 0 X (n-l)/{N-l) 

+(1 - y)^-^ X (iV - n)/(N - 1) 

= 0.99^^ X 900/999 = 0.3331 . 



This result is obtained in the node X2&...&X^ in 
Figure 7ii. 

More generally, it also worth noting that the evidential 
value of 'excluding' indiviuals 2, n does not depend on 
the rarity of the compared analytical characteristic y but 
only on the size of the target population and the size of the 
database. The evidence will be stronger or weaker depend- 
ing on whether the database covers, respectively, a greater 
or a smaller proportion of the population. 



Table 5 Probability table for summary node Hi 



H: 


Hi 




Hn+l_W 




1 


0 


0 


Hi 


0 


1 


1 



Logical values assigned to the states H] and H] of the node H] . 
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X2&...&Xn 


Ml 100 
Ml 0 


X2&...&Xn ■ 36.97 
X2&...&Xn 63.03 



Hi 


Hi 


0 


Hi 


100 




Ml 



Mil 
Ml ■ 



■99.00 



X2&.. 


&Xn 






X2&.. 
X2&.. 


■ &Xn 
&Xn 




33.31 
66.69 



(i) 



(ii) 



Figure 7 Alternative representations of a Bayesian networic for a searcii of a database of size n > 2. Bayesian network shown in Figure 5 witli 
expanded representation of nodes along with an additional node Hi with states Hi ('the suspect is the source of the crime stain') and Hi ('some 
person other than the suspect is the source of the crime stain (either someone else in the database or an individual outside the database)'). Both 
figures show a situation in which the size n of the database equals 1 00 and that of the suspect population, N, equals 1 ,000. The rarity of the 
corresponding characteristic is set to 0.01. (i) Illustration of the evaluation of the numerator of the likelihood ratio for the item of 
information X2&-&^n (that is, none of the n — 1 individuals in the database other than the suspect corresponds to the crime stain): Hi is set to 'true', 
and the value of the numerator is shown in the nodeX'2&-&^n- (ii) An evaluation of the denominator of the likelihood ratio (that is. Hi is set to 'true'). 
Probabilities are shown in percentages. 



Conclusions 

Logically compelling argument has been presented in sci- 
entific literature in support of the argument that excluding 
individuals in a database represents evidence that tends to 
strengthen the case against a matching suspect [18,35]. It 
is widely conceded, however, that the associated mathe- 
matics is not easy to explain, in particular to lay persons, 
and even so in trial proceedings. It is therefore desirable 
that, at least among forensic scientists and legal profes- 
sionals, there is a common and agreed understanding of 
the proofs and logic that support the prevalent scientific 
opinion in this area. 

However, within the scientific community, this seems to 
be a difficult endeavor. This is illustrated, for example, by 
the critical debates that have at some point accompanied 
the discussion of settings in which a suspect was selected 
through a database search, as is illustrated by [14,36]. 
In some parts of the forensic community, opinions cur- 
rently persist according to which a database search should 
weaken a case against the suspect. A recent example for 
this is a recommendation issued by the German Stain 
Commission [33]. That document falls for the known mis- 
conception that it should be of concern that one is looking 
for individuals that possess a profile that corresponds to 
the crime stain. This is motivated by the intuitively appeal- 
ing but logically unfounded argument that (1) it is unsur- 
prising to see that the suspect that is found as a result of 
a database search will present the target profile, and (2) 
therefore, the corresponding crime stain profile ought to 
be of little or reduced evidential value. It seems that such 
opinion is influenced by asking questions of the follow- 
ing kind: 'What are the chances of finding an individual 
that has the crime stain genotype if one is searching for 
individuals who could possess that genotype (for example. 



by searching a database)?' However, this is not a very help- 
ful question because it does not serve well the needs of the 
recipients of expert information. They rather seek infor- 
mation regarding a question of the following kind: 'Given 
that a person was found with a profile that corresponds to 
that of a crime stain, what is the strength of the evidence 
against this suspect?' 

As mentioned above, the principal routes of logical anal- 
ysis lead to the conclusion that the case against a matching 
suspect is strengthened when excluding other potential 
donors. This may be pointed out either through analy- 
ses of the posterior probability of the proposition that 
the suspect is the source of the crime stain or through 
a likelihood ratio analysis. The rigor of the analyses put 
forward in literature is also paired with convincing impli- 
cations in limiting cases, that is, when all potential sources 
are excluded, then the procedures indicate that the only 
matching suspect must be the crime stain donor. When no 
individuals other than the suspect are investigated, then 
the case against the suspect reduces to the evaluation of a 
one-stain one-offender case. Such a case may, within some 
general assumptions, be assessed in terms of the inverse 
of the random match probability [16]. 

Despite these entirely reasonable implications, both 
widening the acceptance of these inferential procedures 
as well as their teaching in education remains a chal- 
lenging task. This topic has thus been made a principal 
aim of analysis and discussion in this paper. The guiding 
ideas throughout were twofold. Firstly, the database search 
problem was discussed as a generalization of the island 
problem. Secondly, all probabilistic analyses are system- 
atically tracked within graphical models (that is, Bayesian 
networks). The merit of a methodology with a graphical 
support is that it allows one to point out that the various 
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inferential procedures have common underlying patterns 
of inference. Therefore, a Bayesian network approach is 
not only helpful for examining the logic within a given 
inferential procedure, but is also valuable for checking 
the coherence between different inferential approaches 
(here: the relationship between the island problem and the 
database search issue). 

More generally, starting with the island problem is help- 
ful because it is well posed. It is instructive to point out 
the rationale of the argument in a simple world' context. 
This can favor the understanding of the main principle of 
the argument without possible distraction due to partic- 
ular numerical settings. The inherent reason behind the 
searching among islanders is that any individual found 
to have a profile other than that of the crime stain is 
excluded — under the assumption of absence of laboratory 
error — as a potential source. The pool of potential donors 
thus becomes smaller with the corollary that the suspicion 
against each remaining potential source must increase. 
Stated otherwise, probability is to be redistributed among 
fewer candidates. 

The Bayesian network approach discussed in this paper 
provides a clear illustration for this. In Figure 8, a case 
with a population size N = 1,000 and n = 100 is 
considered, along with an analytical characteristic which 
occurs with probability y = 0.01. Knowing that the 
individuals 2, n do not correspond to the crime stain 
sets the proposition H2_n> that one of the n — 1 indi- 
viduals of the database is the source of the crime stain, 
to zero. Accordingly, probability must be redistributed 
among the propositions Hi ('the suspect is the source 
of the crime stain) and Hn^ijq ('the true source of the 
crime stain is outside the database'). It then becomes 
a question to know how this ought to be operated. If 
one assumes that, initially, each individual / had the 



same probability of being the source of the crime stain, 
then Pr{Hi) = 1/N (for / = 0,1,..., TV). Next, if one 
excludes the individuals i = 2, n, then the poste- 
rior probability for the remaining individuals must reflect 
a proportional increase! For example, if the proposi- 
tion for the suspect initially had a probability of 1/N, it 
has 1/(N — n -\- 1) after excluding the n — 1 individuals 
in the database (other than the suspect). This is shown in 
the Bayesian network in Figure 8 where, after consider- 
ation of the evidence X2&...&X;^, the proposition Hi has 
the probability 1/(N — n -\- 1) = 0.00111 and the propo- 
sition Hyi+ijq has the probability {N — n)/{N — n -\- 1) = 
0.99889. 

The graphical display in the proposed Bayesian network 
is particularly compelling. If probability from one proposi- 
tion (here: H2_n) is taken, then it must well 'go' somewhere 
because, on the whole, the condition J2i^i^^(^i) — 1 
must remain satisfied. It is not conceivable that probability 
is transferred exclusively to //fz+ijv> as suggested by pro- 
ponents of a decreasing probative value due to a database 
search. The reason for this is that with increasing database 
size n, the number of distinct propositions (that is, indi- 
viduals) subsumed under Z/^+uv decreases. By all logic, 
the proposition Hi must thus be reinforced. 

A Bayesian network approach was pursued in this paper 
because it has the advantage of offering a concise repre- 
sentation and description of (1) the various components 
(variables and probability assignments) that make up a 
given inferential procedure as well as (2) their relation- 
ships. From a purely descriptive point of view, the general 
Bayesian network proposed here in Figure 5 allows one to 
point out the following aspects: 

1. The size of the database n and the size N of the 
population of potential sources can be used to define 
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Figure 8 Bayesian networic for assessing the effect of excluding individuals in a database. Bayesian network described earlier in Figure 7. 
Illustration of a situation in which only the information about the n — 1 non-matching individuals in the database, other than the suspect, is 
considered. The node H illustrates that this information leads to a posterior probability of 1 over the population size minus the excluded individuals 
for the proposition that the suspect is the source of the crime stain: Pr{H] \ X'2&-&^n) = 1 /{N - n + 1 ) = 0.001 1 1 . All probabilities are shown in 
percentages. Further details are as given in the text. 
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the distribution of probability among the competing 
propositions of the node H. 

2. The evidence in a database search scenario consists 
of two distinct items of information. One of them is 
the observed correspondence between the suspect s 
profile and that of the crime stain. It depends on 
whether the suspect is or is not the source of the 
crime stain as well as on the rarity of the 
corresponding analytical characteristic. Most 
importantly, it is not directly depending on the size 
of the database or the size of the population of 
potential sources. A second item of information 
pertains to the individuals in the database other than 
the suspect, that is, the fact that these n — I 
individuals do not correspond to the profile of the 
crime stain. This variable does depend on the size of 
the database as well as the rarity of the analytical 
characteristic. 

3. The matching suspect and the non-matching 
individuals in the database are, as is implied by the 
network s graphical structure, distinct items of 
evidence that are independent conditionally upon 
knowledge of the target proposition H and the rarity 
of the corresponding characteristic y. 

From a dynamic point of view, Bayesian networks allow 
their user to track probabilistic calculations in many dif- 
ferent ways. As pointed out throughout this paper, the 
proposed Bayesian network supports the calculation of 
both posterior probabilities as well as components of the 
likelihood ratio. For example, the user can investigate the 
effect of the two distinct items of evidence sequentially, 
irrespective of the order in which they are considered. An 
important implication of this is that reducing the pool of 
potential donors tends to strengthen the case against a 
suspect. In particular, this can be considered even before 
learning whether or not the suspect actually matches. 
Stated otherwise, knowledge of the 'matching status' of 
a suspect is not a necessary requirement for assessing 
the probative value of excluding other individuals in the 
database. 

All of these aspects offer valuable assistance in teach- 
ing. The authors currently rely on Bayesian networks as 
an approach to support and complement more formal 
learning material used within their institution. Both the 
construction and subsequent analysis of Bayesian net- 
works with now widely available computer software is 
found very helpful by students to learn about and fur- 
ther the understanding of the mathematics that underly 
different evaluative procedures and legal problems in gen- 
eral (as illustrated, for example, by the island problem). 
Bayesian networks translate formal procedures within a 
graphical environment which can actively be explored 
by learners. This explanatory capacity makes Bayesian 



networks particularly attractive for students who may find 
purely algebraic approaches difficult to apprehend. More 
generally, it is the hope of the authors that Bayesian net- 
works could also support practitioners in ongoing debates 
by illustrating the logic of probabilistic solutions . 

Endnotes 

^ In the UK, for example, the national database was intro- 
duced in 1995 [37] on the basis of the Criminal Justice 
and Public Order Act. As a further example. New Zealand 
introduced its database in 1996 [38]. 
^ See also section 'Evidential value of 'database hits': two 
decades of debate! 

As a point of comparison, the Swiss database contains 
1.7% of the Swiss population, against 0.8% in Germany, 
1.8% in France, and 2.5% in the USA. 
^ The European Convention on Human Rights does not 
include the principle that courts have discretion to assess 
evidence, but it applies in every European country [39]. 
^ The Royal Statistical Society's Statistics and the Law 
Working Group is currently preparing a practitioner's 
guide on Bayesian networks and inferential reasoning. 
The group's first report. Guide No 1, on the fundamentals 
of probability and statistical evidence in criminal proceed- 
ings [8] was published in November 2010. 
^ Let us recall that Li = 0 for / = 2, . . . , 
^ Individual 2 will not correspond with probability (1 — y) 
if either the suspect or individual 3 is the true source of 
the crime stain. 

^ The analogy drawn here is only that of making reference 
to the general kind of inference problem. In reality, the 
Monty Hall problem is slightly more subtle because it con- 
tains the additional information about the initial choice of 
a door by the contestant as well as the fact that the game 
presenter will not open the door chosen by the contestant 
nor that behind which the prize is. 

^ An assumption made here is that the individuals are 
unrelated. 
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